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Abstract —Widespread use of the Internet and social 
networks invokes the generation of big data, which is 
proving to be useful in a number of applications. To deal 
with explosively growing amounts of data, data analytics 
has emerged as a critical technology related to computing, 
signal processing, and information networking. In this pa¬ 
per, a formalism is considered in which data is modeled as a 
generalized social network and communication theory and 
information theory are thereby extended to data analytics. 
First, the creation of an equalizer to optimize information 
transfer between two data variables is considered, and 
financial data is used to demonstrate the advantages. Then, 
an information coupling approach based on information 
geometry is applied for dimensionality reduction, with a 
pattern recognition example to illustrate the effectiveness. 
These initial trials suggest the potential of communication 
theoretic data analytics for a wide range of applications. 

Index Terms —big data, social networks, data analysis, 
communication theory, information theory, information 
coupling, equalization, information fusion, data mining, 
knowledge discovery, information centric processing 


1. Introduction 

With the booming of Internet and mobile commu¬ 
nications, (big) data analytics has emerged as a crit¬ 
ical technology, adopting techniques such as machine 
learning, graphical models, etc. to mine desirable infor¬ 
mation for a wide array of information communication 
technology (ICT) applications Q El El 131 (S |[6l . Data 
mining or knowledge discovery in databases (KDD) 
has been used as a synonym for analytics on computer 
generated data. To achieve the purpose of data analytics, 
there are several major steps: (i) based on the selection 
of data set(s), pre-processing of the data for effective 
or easy computation, (ii) processing of data or data 
mining, likely adopting techniques from statistical infer¬ 
ence and artificial intelligence, and (iii) post-processing 
to appropriately interpret results of data analytics as 
knowledge. Knowledge discovery aims at either verifi¬ 
cation of user hypotheses or prediction of future patterns 
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from data/observations. Statistical methodologies deal 
with uncertain or nondeterministic reasoning and thus 
models, and are the focus of this paper. Machine learning 
and artificial intelligence are useful to analyze data, 
e.g. 00 ( 71 , typically via regression and/or classifi¬ 
cation. With advances in supervised and unsupervised 
learning, inferring the structure of knowledge, such as 
inferring Bayesian network structure from data, is one of 
the most useful information technologies [0. In recent 
decades, considerable research effort has been devoted 
to various aspects of data mining and data analysis, 
but effective data analytics are still needed to address 
the explosively growing amount of data resulting from 
Internet, mobile, and social networks. 

A core technological direction in data analytics lies 
in processing high-dimensional data to obtain low¬ 
dimensional information via computational reduction 
algorithms, namely by nonlinear approaches 01101 , 
compressive sensing m, or tensor geometric analy¬ 
sis Ha. In spite of remarkable achievements, with the 
exponential growth in data volume, it is very desirable to 
develop more effective approaches to deal with existing 
challenges including effective algorithms of scalable 
computation, complexity and data size, outlier detec¬ 
tion and prediction, etc. Furthermore, modem wireless 
communication systems and networks to support mobile 
Internet and Internet of Things (loT) applications require 
effective transport of information, while appropriate data 
analytics enable communication spectral efficiency via 
proper context-aware computation ca. The technologi¬ 
cal challenge for data analytics due to very large numbers 
of devices and data volume remains on the list of the 
most necessary avenues of inquiry. At this time, state-of- 
the-art data analytics primarily deal with data processing 
through computational models and techniques, such as 
deep learning M. There lack efforts to examine the 
mechanism of data generation ca and subsequent rela¬ 
tionships among data, which motivates the investigation 
of data analytics by leveraging communication theory 
and information theory in this paper. 

As indicated in ca and other works, relationships 
among data can be viewed as a type of generalized social 
network. The data variables can be treated as nodes in 
a network and their corresponding relationships can be 
considered to be links governed by random variables (or 
random processes by considering a time index). Such 
scenarios are particularly relevant for today’s interactive 
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Internet data from/related to social networks, social 
media, collective behavior from crowds, and sensed data 
collected from sensors in cyber-physical systems (CPS) 
or Internet of Things. Therefore, with the aid of this 
generalized social network concept, we propose a new 
communication theoretic methodology of information¬ 
centric processing for (big) data analytics in this pa¬ 
per. Furthermore, by generalizing the scenario from a 
communication link into a communication network, we 
may use ideas from network information theory and 
information geometry to develop a novel technology 
known as information coupling Oil, which suggests a 
new information-centric approach to extraction of low¬ 
dimensional information from high-dimensional data 
based on information transfer. These technological op¬ 
portunities describe a complete communication theoretic 
view of data analytics. 

The rest of this paper is organized as follows. Section 
II presents our modeling of data into a generalized social 
network and its resemblance to typical communication 
system models. Section III describes the setting of our 
proposed communication theoretic data analytics to more 
effectively process the data, using financial data to 
illustrate the processing methodology with comparisons 
to well-known techniques. Related literature is reviewed 
to better explain our proposed methodology. Section IV 
briefly introduces the rationale from information geom¬ 
etry and the principle of information coupling to realize 
dimensionality reduction, with an image pattern recogni¬ 
tion example to show the effectiveness of this new idea 
based on network information theory. Finally, we make 
concluding remarks in Section V, with suggested open 
problems to fully understand and to most effectively 
revisit data mining and knowledge discovery in (big) data 
analytics. In addition to its potential for creating new 
methods for data analytics, this new application scenario 
also creates a new dimension for communication and 
information theory. 

II. Social Network Modeling of Data 

As noted in ca, entities (e.g. data) with relationships 
can be viewed as social networks, and thus social net¬ 
work analysis and statistical communication theory share 
commonalities in many cases. For data analytics, it is 
common to face a situation in which we have two related 
variables, say X and Y. When there exists uncertainty 
in observing these two variables, it is common to model 
these two variables as random variables. If a time index 
is involved, say the variables are observed or sampled in 
sequence, these two variables are actually two random 
processes. Consequently, each sequence of data drawn 
from a variable is actually a sample path (i.e. sampled 
data) of the random process. An intuitive way of exam¬ 
ining the relationship between the two processes is to 



Fig. 1. Graphical model of network variables for a large data set. 

compute the correlation coefficient between these two 
sampled data sequences. 

For big data problems in an Internet setting, we are 
often facing a substantial number of variables up to 
thousands or even millions in order, and therefore must 
rely on machine learning to handle such scenarios to 
predict or otherwise to infer from data. One of the 
key problems is to identify low-dimensional information 
from high-dimensional data, as a key issue of knowledge 
discovery. Recently, another vision, known as small data, 
has emerged to more precisely deal with variables of 
data on a human scale El. Therefore, in data analytics, 
whether addressing big data or small data, effective and 
precise inference from data is always the fundamental 
challenge. An approach different from machine learning 
arises by extracting embedded information from data. 
More precisely, for example, we may identify the infor¬ 
mation flowing from variable X to variable Y just as in 
a typical point-to-point digital communication system. 

Unfortunately, real world problems are much more 
complicated than a single communication link, and there 
are many more variables involved. Figure 1 depicts the 
social network structure of a large data set through 
realization of graphical models and Bayesian networks, 
while each node (i.e. variable) represents a data variable 
(actually a vector of data) and each link represents the 
relationship and causality between two data variables. 
Such relationships between two data variables usually 
exhibit uncertainty due to the nature of data or imperfect 
observations, and thus require probabilistic modeling. 
Even more challenging, such causal relationship among 
large numbers of variables may not be known, and thus 
a challenge is to determine or to learn the knowledge 
discovery structure laElGol. 

The social network analysis of data can be performed 
in different ways, such as using graphical models with 
machine learning techniques nacsi. However, as noted, 
such widely applied methodologies focus on data pro- 
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Fig. 2. Communication theory in signal flow or graphical model to 
show causal relationship in data variables. 

cessing and inference, rather than considering informa¬ 
tion transfer. Communications can be generally viewed 
as transmission of information from one location to 
another location as illustrated in Figure 2(a). We may 
further use signal flow of random variables to abstractly 
portray such a system as in Figure 2(b). The channel as a 
link between random variables X and Y, can be charac¬ 
terized by the condition probability distribution f{y\x). 
When a channel involves multiple intermediate variables 
relating X and Y, this results in receive diversity as 
shown in Figure 2(c). 

More advanced communication theory, namely multi¬ 
ple access communication, has been developed in recent 
decades and may be useful for Internet data analyt¬ 
ics. Multiuser detection (MUD), though commonly con¬ 
nected with code division multiple access (CDMA), gen¬ 
erally represents situations in which multiple user signals 
(no need to be orthogonal) are simultaneously transmit¬ 
ted then detected over the same frequency band dl. In 
such situations, the signal model can be described as 

Y = (AR)X + N 

where X is the transmitted signal vector; Y is the 
received signal vector embedded in noise N; R denotes 
the correlation matrix among signals used by transmitter- 
receiver pairs and A is a diagonal matrix containing 
received amplitudes. The non-diagonal part of AR re¬ 
sults in multiple-access interference (MAI). Similarly, a 
multiple antenna (MIMO) signal model can be described 
mathematically as 

Y = HX + N 

where H is the channel gain matrix 1211 . From the 
similarity in mathematical structure of MUD and MIMO 
systems, a duality in receiver structures can be also 
observed. This is fundamentally the same information 



Fig. 3. Distributed detection for sensor networks. 

flow Structure as in data analytics, and so multiuser 
communication theory provides a new tool to com¬ 
prehend information flow in general social networking 
models. In ca, recommender systems are illustrated as 
an example for this potential. 

In this paper, we will delineate the connection between 
data analytics and social network analysis, considering a 
link in a generalized social network to be equivalent to a 
communication link. Consequently, we can leverage the 
knowledge of communication theory to investigate data 
analytics, a process we term communication theoretic 
data analytics. Processing on graphs to extract motif 
information aims at alleviating the complexity of data 
analytics (221, and bears somewhat the same spirit as 
the communication theoretic data analytics. Section III 
will introduce the optimal receiver to tackle nonlinear 
distortion by using an equalizer, and thereby to optimize 
information transfer for more effective data analytics. A 
further interesting communication model is the sensor 
network illustrated in Figure where an information 
source is detected by multiple sensors that send their 
measurements to a fusion center for decisions (231. The 
number of sensors might be large but the actual informa¬ 
tion source might be simple. Directly processing sensor 
data might be a big data problem but a single source 
may induce simpliflcation by considering information 
transfer, which alternatively suggests an information 
theoretic formulation of information coupling toward 
critical dimension reduction in data analytics, which will 
be introduced in Section IV. 

III. Communication Theoretic Implementation 
OF Data Analytics and Applications 

Using the communication theoretic setup, we will 
demonstrate how to deal with practical data analytics. 
To infer useful results from big data, we will be able to 
acquire some knowledge, say the conditional probability 
structure /(^n|^m) in a general social network modeling 
of big data as in Figure 1. Through communication the¬ 
ory, we may treat X^ as an information transmitter and 
Yn as the information receiver. In our setting, f{yn\xm) 
represents the communication channel, which can be 
corrupted by noise due to imperfect sampling, noisy 
observation, and possible interference from unknown 
variables or outliers. This point-to-point scenario is well 
studied in communication theory. 
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Fig. 4. Equalizer structure to extract the causal relationship between 
two variables, from X to Y. 

A common and most fundamental scenario in (big) 
data analytics is to find the relationship between two 
variables, X and Y, while each variable represents a 
set or a series of data. Let us borrow the concept 
of Bayesian networks or graphical models in machine 
learning. We may subsequently infer the realization y of 
Y based on the observation of the realization x of X. 
Suppose there exists a causal relationship X ^Y, with 
uncertainty (or embedded in an unknown mechanism 
allowing information transfer). As noted above, this 
causal relationship can be represented by the conditional 
or transition probability f{y\x), which can be considered 
as a communication channel. Direct computation and 
inference therefore proceed based on this structure. If 
we intend to infer the knowledge structure such as this 
simple causal relationship, machine learning on (big) 
data becomes a principal problem iiiiiDiiiT]. 

A. Equalizer to Optimize Information Transfer 

Again, our view of this simple causal relationship 
considers information transfer from X to F as a com¬ 
munication channel. In this context, if we want to deter¬ 
mine the causal relationship of data due to information 
transfer, according to communication theory, we should 
establish an equalizer for the channel to better receive 
such information. The equalizer in data analytics is most 
appropriately of the form of an adaptive equalizer, with 
earlier observed data used for the purpose of training to 
obtain the weighting coefficients. The most challenging 
task in machine learning is knowledge discovery, i.e., 
identifying the causal relationship among variables. The 
communication theoretic approach supplies a new angle 
from which to examine this task, and its information 
theoretic insight will be investigated in Section IV via 
information coupling. 

Well known in communication theory, an equalizer 
serves as a part of an optimal receiver to detect a 
transmitted signal from a noisy channel distorted by 
nonlinearities and other impairments. In order to infer 
from another data variable or time series, we may take 
advantage of the same concept to process the data. 
Proper implementation of an equalizer could enhance 


the causal relationship between data variables and time 
series, and thus allow better judgement or utilization of 
the causal relationship. This is one core issue in knowl¬ 
edge discovery of data analytics. In big data analytics, 
this knowledge problem has a very large number of 
variables. As in Figure 1, we focus on the problem 
of identifying the causal relationships between the set 
of variables Xi,X 2 ,'-' ,Xm and the set of variables 
Fi, F 2 , • • • 5 X/v, specified by appropriate weights. This is 
identical to the following multiple access communication 
problem: 

(Vi,V2,...,VM)[%]MxiV = (Vi,y2,...,y]Vr (1) 

where [hif = H is analogous to the channel matrix. This 
knowledge discovery problem in data analytics is thus 
equivalent to a blind channel estimation/identification 
problem in multiple access communication. Since a 
feedback channel may not exist in general, this is a blind 
problem. However, for online processing, equivalent 
feedback might not be impossible, which we leave as an 
open issue. We start from some simple cases to illustrate 
this idea. 

• Information diversity: A variable X may infiuence 
a number of variables, say Fi, F 2 , • • • , Fat, which is 
a form of information diversity. To identify a causal 
relationship between X and F^, this is precisely a 
multi-channel estimation problem such as arises in 
wireless communication. Since feedback in causal 
data relationships is generally impossible, such a 
class of problems falls under the category of blind 
multi-channel estimation/identification 1 ^ 1 ^ . 

• Information fusion: Another class to consider 
is the causal relationship from many data vari¬ 
ables to infiuence a single data variable, say 
Xi, X 2 , • • • , Xm to F. This corresponds to multi- 
input-single-output channel estimation or identifica¬ 
tion, which is a rather overlooked subject. However, 
another similar problem, source separation has been 
well studied. 

B. Applications to Inference on Financial Time Series 

A useful way to demonstrate our analytical method¬ 
ology is to consider financial time series data, which 
has been well studied in the literature. The purpose 
is to demonstrate the prediction of stock prices from 
other factors. In this example, we are trying to predict 
the stock price of Taiwan Semiconductor Manufacturing 
Corp. (TSMC), which is the world’s largest contract 
chip fabrication company. To demonstrate information 
transfer and thus communication theoretic data analytics, 
we consider two factors, which appear to be somewhat 
indirectly related to stock prices in Taiwan but are 
potentially influential: the exchange rate between US 
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Fig. 5. Prediction of TSMC stock price by (a) multivariate linear regression, and (b) multivariate Bayesian estimation, where blue dots denote 
actual prices and red crosses denote prices inferred from the exchange rate and NASDAQ index. 


dollars (USD) and New Taiwan Dollars (NTD, the local 
currency in Taiwan), and the NASDAQ index which 
primarily consists of high-tech stocks in the US. 

Let the time series of the exchange rate between USD 
and NTD be Ai[n], n = 0,1,... and the time series of 
the NASDAQ index be X 2 [n\, n = 0,1,.... The time 
series for the stock price of TSMC is F[n], n = 0,1,.... 
Now, both Xi[n] and X 2 [n] may influence Y[n]. This 
classical problem has been typically handled by multi¬ 
variate time series analysis, which serves as a benchmark 
without introducing more advanced techniques. 

Now we treat this information fusion problem as a 
two-channel information transfer (and thus communi¬ 
cation) problem: Xi[n] Y[n] and X 2 [n] Y[n]. 

To proceed, we establish an equalizer to Alter the data 
in each channel. The equalizer is typically of the tap- 
delay-line type, while the time unit is one day since our 
data uses the closing rate/index/price. The length of this 
adaptive equalizer is L and the corresponding weighting 
coefficients are determined via training. We treat 2009- 
2013 as the training period and then infer 2014 data in 
an online way. The order of the adaptive equalizer and 
the weighting coefficients are learned during the training 
period, and they are kept and used during inference. Our 
numerical experiment shows that 

• Each individual factor (exchange rate or NASDAQ 
index) is surprisingly useful but each alone is not 
good enough to infer the TSMC stock price. Fur¬ 
thermore, via classical methods such as linear least 
squares or Bayesian estimation 1^ . the exchange 
rate appears to be a much less predictive factor than 
the NASDAQ index, which is to be expected since 
the NASDAQ index to a certain extent can represent 
high-tech stocks including TSMC. 

• It is expected that multivariate statistical analysis 
would help in this case. We adopt multivariate linear 
regression usi and multivariate Bayesian estima- 
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Fig. 6. Communication theoretic data analytics using an equalizer to 
optimize information transfer from (a) the exchange rate, and (b) the 
NASDAQ index to the TSMC stock price. 


tion 1^ as benchmark techniques. The inference 
of TSMC stock prices from the exchange rate and 
NASDAQ index is shown in Figure The mean 
square errors for both techniques perform similarly 
to the results using only the NASDAQ index. 

• We implement the tap-delay-line equalizer structure 
of Figure]^ to optimize information transfer. Based 
on the mean square error (MSE) criterion, we search 
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Stock price of TSMC (Inference of 2014) 



Fig. 7. Inference via MRC information combining two equalized data 
processing channels, where the length of the equalizer for the exchange 
rate is 16 (days) and the length of the equalizer for the NASDAQ index 
is 61 (around 3 months). 

for the best length equalizer and corresponding 
weighting coefficients, which are then used for 
inference. During the inference period of 2014, 
the length remains the value from training but 
the weighting coefficients are updated online. In 
Figure we surprisingly observe excellent per¬ 
formance of inference using the exchange rate, 
as effective as the NASDAQ index. This result 
illustrates different insights into the correlation of 
data from the traditional approaches, since these 
approaches suggest that the NASDAQ index can 
describe the TSMC stock better. Therefore, our 
result demonstrates the potential of this communica¬ 
tion theoretic data processing methodology and the 
potential of considering information transfer. Thus, 
the potential of information-centric data processing 
over conventional machine learning is worth further 
study. 

• Similar to diversity combining in digital communi¬ 
cation systems, information combining in commu¬ 
nication theoretic data analytics potentially further 
improves the performance of inference. Maximal 
ratio combining (MRC) is well known to be op¬ 
timal for diversity combining based on signal-to- 
interference-plus-noise ratio (SINR). Using a simi¬ 
lar concept as equation ([Tg, we develop MRC in¬ 
formation combining to weight equalized channels 
inversely proportional to the MSB in the training 
period. Such weights in MRC information com¬ 
bining can be updated online. Figure [7] depicts 
the prediction results (red crosses) and true values 
(blue dots). We calculate the mean square error and 
find even better performance than using only the 
exchange rate. The MSB can be lower than those 
of multivariate linear regression and multivariate 
Bayesian estimation. 

In Appendix A, we develop the following rules of 


thumb for communication theoretic data analytics, while 
subject to further enhancements. 

Problem: To infer Y based on Ai, X 2 ,..., Aat. 
Procedure: 

(1) Use an equalizer (i.e. optimal receiver) implemen¬ 
tation to identify causal relationships among data 
variables, Xi Y,X 2 Y,..., Xjy Y, with 
the corresponding MSEs according to the training 
dataset(s). 

(2) Select Nc data variables to transfer sufficient infor¬ 
mation (or sufficiently small MSE errors in training) 
to identify the structure of knowledge by keeping 
the length and coefficients of the equalizer, or by 
online update of coefficients. 

(3) Conduct MRC information fusion of these Nc data 
variables as in Fig. 12 to infer. 

Remark 1. The conjecture that this communication the¬ 
oretic data analytics approach delivers more desirable 
performance comes from optimizing information transfer 
and avoiding cross-interference among data variables 
(similar to multiple access interference in multiuser 
communication), while existing multivariate statistical 
analysis or statistical learning multiplexes all data vari¬ 
ables together to result in multiple access interference 
in data analytics. Furthermore, for each data variable, 
a selected equalizer length with coefficients is used, 
then information is combined with other data variables, 
to allow better matching to extract information in this 
communication theoretic data analytics approach. Each 
equalizer is designated to match a specific data variable, 
while multivariate analysis usually deals with a common 
fixed depth of observed data in processing for all data 
variables. More observations may not bring in more rele¬ 
vant information but rather additional noise/interference. 
Although recent research suggests a duality between 
time series and network graphs ll28]|||29l , information¬ 
centric processing of data suggested by communication 
theory supplies a unique and generally applicable view of 
inference, even though its extension to more complicated 
network graphical relationship of data is still open. 
Note that during the training period, the computational 
complexity is high, however, the computation load is 
rather minor in the inference stage. 

Remark 2. Applying information theoretic data analytics 
such as mutual information and information divergence 
beyond correlation have been proposed in the literature, 
e.g. 1301 ISTl . However, such efforts have not systemat¬ 
ically applied information-centric processing techniques 
based on communication systems as suggested in this 
paper. In the mean time, though we have illustrated only 
one-hop network graphical inference, these methods may 
be applied further for data cleaning, data filtering, iden¬ 
tification of important data variables for inference, and 
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identification of causal relationships among data vari¬ 
ables to support knowledge discovery in data analytics. 
To fuse heterogeneous information, fuzzy logic 1301 or 
Markov logic ED is usually employed. In the proposed 
approach, information combining of different-depth in¬ 
formation transfer alternatively serves the purpose in an 
effective way, while time series data mining typically 
considers similarity measured in terms of distance ll^ . 

At this point, we cannot conclude that communi¬ 
cation theoretic data analytics are better than multi¬ 
variate analysis and other machine learning techniques, 
as many advanced techniques such as Kalman filter¬ 
ing 1^ . graphical models 1341 . or role discovery m 
that are somewhat similar to our proposed approach have 
not been considered in our comparison. However, via 
the above example, the communication theoretic data 
analytical approach indeed demonstrates its potential, 
particularly as a way of fusing information, while it 
seems more difficult to achieve a similar purpose by pure 
multivariate analysis. Some remaining open problems for 
communication theoretic data analytics are 

• How to measure the information transfer from each 

variable ^ F, i = 1, 2,_ 

• How to determine a sufficient amount of informa¬ 
tion transfer? And thereby, how to determine which 
variables should be considered in data analytics. 

A more realistic but complicated scenario involves 
information fusion and information diversity at the same 
time, which is a multiuser detection or MIMO problem. 
Due to space limitations, it is not possible to explore 
this idea here, in spite of its potential applicability to 
different problems, such as recommender systems etc. 
Some open issues include: 

• Joint prediction of Yi, F 2 , • • •, from 

Ai, A 2 ,..., Am, and associated techniques 

to effectively solve this problem, such as sub-space 
approaches etc. 

• Optimal MUD has NP-hard complexity, leading 
to suboptimal receivers such as the de-correlating 
receiver. Comparisons of such structures to mul¬ 
tivariate time series analysis, mathematically and 
numerically, are of interest. 

IV. Information Coupling 

Thus far, we have intuitively viewed communication 
theoretic data analytics as being centered on information 
transfer among different data variables, and then applied 
receiver techniques to enhance data analytics. This sug¬ 
gests that the amount of information in the data is in fact 
far less then the amount of (big) data. The methodology 
in Section III deals with a number of data variables to ef¬ 
fectively execute information processing analoguesly to 
communication systems and multiuser communications. 


The remaining challenge is to identify low-dimensional 
information structure from high-dimensional (raw) data, 
and to better construct intuition about communication 
theoretic data analytics from information theory. In this 
section, we aim to apply the recently developed infor¬ 
mation coupling E3 to achieve this purpose, and also 
provide information theoretic insights to information¬ 
centric data processing. 

A. Introduction to Information Coupling 

From the analog between communication networks 
and data analytics, intuition suggests the need for 
information-centric processing in addition to data pro¬ 
cessing for not only optimizing communication sys¬ 
tems/networks but also mining important information 
from (big) data. The conventional studies of information 
processing are limited to data processing, thereby focus¬ 
ing on representing information as bits, and transmitting, 
storing, and reconstructing these bits reliably. To see 
this, let us consider a random variable with M possible 
values {1, 2,..., M}. If we know its value reliably, then 
we can describe this knowledge with a single integer, 
and then further process the data with this known value. 
On the other hand, if the value is not deterministically 
acquirable, then we need to describe our knowledge with 
an M-dimensional distribution Pm(^), which requires 
M — 1 real numbers to describe. Therefore, the data 
processing task has to be performed in the space of 
probability distributions. 

When we move towards information-centric process¬ 
ing, the general way to describe information processing 
relies on the conditional distribution of the message, 
conditioned on all the observations, at each node of 
the network, e.g., Py^iXm Figure [T] Conventional 
information theoretic approaches working on the dis¬ 
tribution spaces in communication and data processing 
are mostly based on coded transmission, in which the 
desired messages are often quite large, which results in 
the extremely high dimensionality of the belief vectors. 
This is in fact one of the main difficulties of shifting the 
data processing from data centric to information centric. 
It turns out that this difficulty comes from the fact that 
the distribution space itself is not a fiat vector space, but 
is a rather complicated manifold. Amari’s work 1^ on 
information geometry provides a tool to study this space, 
but the analysis can be quite involved in many cases. 
In this section, we propose a framework that allows us 
to greatly simplify this challenge. In particular, we turn 
our focus to low rate information contained in the data, 
which is significant for describing the data. We call such 
problems information coupling problems (371 (381. 

To formulate this problem mathematically, let us con¬ 
sider a point-to-point communication scenario, where 
a signal A is transmitted through a channel with the 


transition probability Wy\x^ which can be viewed as 
a \y\ X I A' I matrix, to generate an output Y. In the 
conventional communication systems, we consider en¬ 
coding a message U into the signal vector X, to form a 
Markov relation U X Y. From which, an efficient 
coding scheme aims to design both the distribution Pu 
and the conditional distributions Px\u=u to maximize 
the mutual information I{U;Y), which corresponds to 
the communication rate. Such optimization problems in 
general do not have analytical solutions, and require nu¬ 
merical methods such as the Blahut-Arimoto algorithm 
to find the optimal value. More importantly, when we 
allow coded transmissions, i.e., to replace X and Y by 
n independent and identically distributed (i.i.d.) copies 
of the pair, it is not clear a priori that the optimizing 
solution would have any structure. Although Shannon 
provided a separate proof for the point-to-point case 
that the optimization of the multi-letter problem over 
Px^\u should also have an i.i.d. structure, failure to 
generalize this proof to multi-terminal problems remains 
the biggest obstacle to solving network capacity and 
subsequently design algorithms. In contrast, the informa¬ 
tion coupling deals with the maximization of the same 
objective function I{U;Y), but with an extra constraint 
that the information encoded in X, measured hy I{U;X) 
is small. With a slight strengthening this constraint can 
be reduced to the condition that all the conditional 
distributions Pxiui'l'^)^ for ^ are close to the marginal 
distribution Px- We refer the reader to i37l for the 
details of this strengthening. With this extra constraint, 
the linear information coupling problem for the point- 
to-point channel can be formulated as 


max -I(U;Y), 

u^x^Yn 


subject to: 


-I{U-,X)<5, 

n 

hPx\U=u-Pxf 

n ' 


0(<5), v«, 


( 2 ) 

(3) 


where 6 is assumed to be small. 


It turns out that the local constraint ([^ in ([^ that 
assumes all conditional distributions are close to the 
marginal distribution, plays the critical role of reducing 
the manifold structure into a linear vector space. In 
addition, the optimization problem, regardless of the 
dimensionality, can always be solved analytically with 
essentially the same routine. In order to show how the 
local constraint helps to simplify the problem, we first 
note that given the conditional distributions Px\u=u 
are closed to Px for all u in terms of S, the mutual 
information I{U;X) can be approximated up to the first 
order as 


/([/; X)=6-y2 Pu{u) ■ Uuf + 0{5), (4) 

U 


where is the perturbation vector with the entries 
ipuix) = {Px\u=u{x) - Px{x)) /\/5- Px{x), for all 
X. This local approximation results from the first order 
Taylor expansion of the Kullback-Leibler (K-L) diver- 
gence D{Px\u=u\\Px) between Px\u=u and Px with 
respect to (w.r.t.) 6. In addition, with this approximation 
technique, we can similarly express the mutual informa¬ 
tion at the receiver end as 


I{U-, Y)=5-y2Pu{u)- Uuf + 0{S), (5) 

U 


where = {PY\u=u{y) - Priv)) 1^5 ■ Pyitl)- 

Now, note that U ^ X Y forms a Markov relation, 
therefore both Py\u=u and Py, viewed as vectors, are 
the output vectors of the channel transition matrix Wy\x 
with the input vectors Px\u=u ^nd Px - This implies that 
the vector is the output vector of a linear map B with 
ipu as the input vector, where 


B = 




w^ 


Y\X 




( 6 ) 


and [\/Pv] and [\/^] denote diagonal matrices with 
diagonal entries Px{x) and PY{y)- This linear map 
B is called the divergence transition matrix (DTM) as 
it carried the K-L divergence metric from the input 
distribution space to the output distribution space. 


We shall point out here that with this local approxi¬ 
mation technique, both the input and output probability 
distribution spaces are linearized as Euclidean spaces 
by the tangent planes around the input and the output 
distributions Px and Py. Hence, we can define the 
coordinate system in both distribution spaces, such as the 
inner product and orthonormal basis, as in the conven¬ 
tional Euclidean spaces. Under such a coordinate system, 
the mutual information I{U;X) becomes the Euclidean 
metric of the perturbation vector ipu averaged over 
different values of u. Similarly, the mutual information 
I{U;Y) can also be viewed as the Euclidean metric 
of the perturbation vector B • ipu at the output space. 
Hence, the optimization problem of maximizing the 
mutual information I{U;Y) is turned into the following 
linear algebra problem: 

max. ■ ||B ■ (7) 

U 

subject to: y^Pui^)' = 1- 

U 

In particular, U can without loss of the optimality be 
designed as a uniform binary random variable, and the 
goal of 0 is to find the input perturbation vector ipu 
that provides the largest output image B • ipu through the 
linear map B. The solution of this problem then relies on 
the singular value decomposition of P, and the optimal 
ipu corresponds to the singular vector of B w.r.t. the 
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Fig. 8. The divergence transition matrix B serves as a linear map 
between two spaces, with right and left singular vectors as orthonormal 
bases. Different input singular vectors have different output lengths at 
the output space. 


largest singular value. Figure illustrates the geometric 
intuition of linearized distributions spaces. 


More importantly, this information coupling frame¬ 
work with the locality constraint allows us to deal with 
the multi-letter problems in information theory in a 
systematic manner. To see this, consider the multi-letter 
version of the problem ^ 


max -I(U;Y^), 

subject to: —/(f/; V") < 6, 
n 


( 8 ) 


l^\\Px^\u=u-Px’^f = 0{6), Vu, 

in which the message is encoded in an n-dimensional 
signal vector and the optimization is over the distri¬ 
bution space of Px^\u=u- applying the same local 
approximation technique, we can again linearize both 
input and output spaces into Euclidean spaces, and the 
linear map between these two spaces turns out to be 
the tensor product Due to the fact that the singular 
vectors of are tensor products of the singular vectors 
of B, we know that the optimal perturbation vector in 
the multi-letter case has the tensor product form, and the 
optimal conditional distribution Px^\u=u has an i.i.d. 
structure lEl. More interestingly, this approach of deal¬ 
ing with multi-letter information theory problems can be 
easily carried to multi-terminal problems, in which all 
the information theory problems are simply reduced to 
the corresponding linear algebra problems. In particular, 
the i.i.d. structure of Px^\u=u for the point-to-point 
case was also observed by Shannon with an auxiliary 
random variable approach; however, the generalization 
of the auxiliary random variable approach to multi¬ 
terminal problems, e.g., the general broadcast channel, 
turns out to be difficult open problems. In a nutshell, 
the information coupling and the local constraint help 
us to reduce the manifold structure into a linear vector 
space, where the optimization problem, regardless of the 
dimensionality can always be solved analytically with 


essentially the same routine. 

Furthermore, the information coupling formulation not 
only simplifies the analysis, but also suggests a new way 
of communication over data networks or information 
transfer over networks of data variables. Instead of trying 
to aggregate all the informations available at a node, 
pack them into data packets, and send them through 
the outgoing links, the information coupling method¬ 
ology seeks to transmit a small piece of information 
at a time, riding on the existing data traffic 1^ . The 
network design of data variables thus focuses on the 
propagation of a single piece of message, from the 
source data variable to all destination data variables. 
Each node in the network only alters a small fraction 
of the transmitted symbols, according to the decoded 
part of this message. The analytical simplicity of the 
information coupling allows such transmissions to be 
efficient, even in the presence of general broadcasting 
and interference. Furthermore, information coupling can 
be employed to obtain useful information from network 
operation, as a complementary function for (wireless) 
network tomography. Consequently, we can analyze the 
covariance matrix of received signals at the fusion center 
in a sensor network to form communities like social 
networks such that energy efficient transmission and 
device management can be achieved. 

B. Implementation of Information Coupling 

Using the information theoretic setup via informa¬ 
tion coupling, we shall demonstrate how to deal with 
practical data analytics. To infer useful results from big 
data, we shall be able to acquire important knowledge 
in general social network modeling of big data such as 
Figure In particular, we consider Xi,, Xm as in¬ 
formation transmitters and Yi,..., Yat as the information 
receivers. The probabilistic relationship between X’s 
and Y’s represents the communication channel, which 
copes with the effects of imperfect sampling, noisy 
observation, or interference from unknown variables or 
outliers. In the following, we are going to demonstrate 
the potential of extracting critical low dimensional infor¬ 
mation from (big) data through the innovative informa¬ 
tion coupling approach. 

To demonstrate the idea, suppose that there is a hidden 
source sequence x'^ = {xi, X 2 ,..., i.i.d. generated 
according to some distribution Px . Instead of observing 
the hidden source directly, we are only allowed to 
observe a sequence ^ 2 , • • •, which can 

be statistically viewed as the noisy outputs of the source 
sequence through a discrete memoryless channel Wy\x- 
Traditionally, if we want to infer the hidden source from 
the noisy observation we would resort to a low¬ 
dimensional sufficient statistic of y^ that has all the 
information one can tell about x^. However, in many 
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cases, such a sufficient statistic might be computationally 
difficult to obtain due to the high dimensional structures 
of x'^ and which turns out to be the common obstacle 
in dealing with big data. In contrast to seeking a useful 
sufficient statistic, we would like to rather turn our 
focus to consider the statistic from y'^ that is efficient 
to describe a certain feature of x'^. 


In particular, there are many different ways to define 
the efficiency of information extraction from data. From 
the information theoretic point of view, we would like 
to employ the mutual information as the measurement 
of the information efficiency about the data. Rigorously, 
we want to acquire a binary feature U in x^ from the 
observed data y^, such that the efficiency, measured by 
I{U;Y'^), can be maximized. In order to find such a 
feature, we shall formulate an optimization problem that 
has the same form as the linear information coupling 
problem ([^, and the optimal solution of ^ character¬ 
izes which feature of x'^ can be the most efficiently 
extracted from the noisy observation y'^ in terms of 
the mutual information metric. Therefore, from llJTl . we 
can explicitly express the optimal solution Px^\u of ® 
as the tensor product of the distribution Px\u{x) = 
Px{x) ± ^j5Px{x) • ipx{x), where 'ipx is the singular 
vector of the DTM with the largest singular value. 

Then, we want to estimate this piece of information 
U from the noisy observation y'^. For this purpose, we 
apply the maximum likelihood principle, and the log- 
likelihood function can be written as 




log 


( PYr^\u=i{y'^)\ 

V Py-{y^) )' 


i = 0,1, 


where Pyn and Pyn|^ are the output distributions of 
the channel Wy\x with input distributions Px^ and 
Px^\u- Then, the decision rule depends on the sign of 
~ when it is positive, we estimate f/ = 0, 

otherwise, U =1. Now, noting that both Py^iu and Pyu 
are product distributions, we can further simplify lo{y^) 
as 


lo{y^) 


log 

i=l 

n 

51 log 

i=l 


/ PYIU=i(yi)\ 

[ PY(yi) J 



il’Y(yi) ^ 


h /fvw 


where in the last equation, we ignore all the higher 
order terms of 6. We call the score function 

-^= : 3^ 1 -^ M, in which the empirical sum of this 
function over the data ,..., is the sufficient statistic 
of a specific piece of information in x'^ that can be the 
most efficiently estimated from y '^. Figure illustrates 



Fig. 9. The score function for the noisy observations. 


the score function in this point-to-point setup. The score 
function derived from the information coupling approach 
provides the maximal likelihood statistics of the most 
efficiently inferable information from the data, and we 
call the score function the efficient statistic of the data. 
The efficient statistic of the data can be deemed as a 
low dimensional label corresponding to the most signif¬ 
icant information of the data that can be employed in 
further data processing tasks. In the next subsection, we 
shall demonstrate how to apply the efficient statistic to 
practical machine learning problems and its performance 
through an image recognition example. 

Finally, we would like to emphasize that the efficient 
statistic can be useful in many machine learning scenar¬ 
ios, such as image processing, network data mining and 
clustering. Consider the social network modeling of big 
data as Figure with very large number of nodes in the 
network. In this case, acquiring a meaningful sufficient 
statistic for the data is usually an intractable task due 
to the complicated network structure. Moreover, even if 
it is possible to specify the sufficient statistic, the com¬ 
putational complexity can still be extremely high due to 
the high dimensional structure of the data. On the other 
hand, the efficient statistic obtained from the information 
coupling provides the information that, while low di¬ 
mensional, keeps the most significant information about 
the original data. This is precisely the main objective of 
the dimension reduction or feature extraction studied in 
machine learning subjects. Equalization in Section III 
may be considered as an intuitive implementation of 
information coupling in big data. In addition, in order 
to acquire the efficient statistic from the data, we simply 
need to solve the score function, i.e., the optimal singular 
vector, which can be computationally efficient. There¬ 
fore, we could see that information coupling potentially 
provides a new framework for efficiently processing and 
analyzing big networked data. 

C. Application to Dimension Reduction in Pattern 
Recognition 

Let us illustrate how the efficient statistic can be ap¬ 
plied to practical data processing. For demonstration pur¬ 
poses, we aim to address the image recognition task of 
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Fig. 11. Experimental results for the separation efficiency with respect 
to different values of e. 




Parallel Noisy 
Observation 
Channels 






(b) 

Fig. 10. (a) Handwriting Recognition between the number “1” and “2” 
via the noisy images, (b) Ising Model of Noisy Images. The pixels of 
the clean image can be viewed as random variables Xi . After passing 
through the channel, the pixels are corrupted by the noise to different 
levels, and the collection of noisy pixels are the random variables Yi. 


and “2” through noisy images, 


handwriting numbers “T 
as illustrated in Figure 
images as from an Ising model as shown in Figure [TO^ 


10(a) We consider these 2-D 


Each clean pixel in Figure |lQ(b) is passed through one 
of the parallel independent noisy observation channels, 
to get a noisy image. In abstract, we can think of the 
pixels of the clean image as a collection of random 
variables Xi, X 2 ,..., Then, passing through noisy 
observation channels with a transition kernel FyN^x^, 
the pixels of the noisy image is a collection of random 
variables Yi, I25 • • •, Tat. Now, to apply the efficient 
statistic to acquire the most significant feature, we shall 
in principle go through the following procedures: 


(1) To determine the divergence transition matrix B, 
we shall determine the distributions PXN and Pyn 
as well as the transition kernel FyN^x^- The dis¬ 
tributions PxN and FyN can be learned from the 
empirical distribution of the images viewed as N- 
dimensional vectors. In addition, we design the 
transition kernel FyN^x^ this image recognition 


example. 

(2) Solve the singular value decomposition of B and 
determine the optimal left singular vector 'ip of B. 
Note that pj has the dimensionality \y\^. 

(3) The efficient statistic is then specified by the 

score function FyN (y^) of the data y^ = 

{^ 1 ,..., which can be obtained by the ^^-th 
entry of the vector pjy, divided by FyN{y^). 


Now, let us demonstrate the application of this pro¬ 
cedure to a practical image recognition problem. Here, 
we employ the MNIST Database 1401 of handwritten 
digits as the test images, where each image is of size 
19 X 19 pixels and each pixel value is scaled to have 
one of four values as in Figure 1 10(a) In particular, as 
shown in Figure 10(a) we have a mixture of images of 
handwritten digits 1 and 2, and we assume that we can 
observe the noisy version of these images. Our goal is 
to separate these noisy images with respect to different 
digits by computing the score of these images with our 
algorithm and ordering them. 

To this end, we view the (clean) images as generated 
from the Ising model, in which each pixel corresponds 
to the nodes X^ in the Ising model. Then, we pass each 
pixel in the images independently through a discrete 
memoryless channel with transition matrix 


l-2e 2e e f 

e l-3e 2e | 

e 0 1 — 4e I 

0 e e 1 — e 


(9) 


Here, the transition matrix is chosen merely to facilitate 
our simulation, and e is the parameter measuring the 
noise level of the channel. After passing clean pixels 
through the channel, we observe the noisy version of 
these images, where each noisy pixel corresponds to Yi 
in this setup. Clearly, the empirical joint input and output 
distributions can be obtained by the statistics of the 
images. Then, we can apply our algorithm to compute 
the score for each noisy image, and then order these 
scores to separate images with respect to different digits. 

To measure the performance of our algorithm, we 
classify a batch of 2N images with X of I’s and N of 
2’s. After ordering the scores, an ideal classifier should 
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have the N lowest scored images belonging to one 
digit and the N largest scores belonging to the other 
digit. To compare with the ideal classifier, we define 
the separation error probability as the proportion of the 
pictures that is wrongly classified, i.e. 


Error probability 


# of wrongly classified pictures 


2N 


( 10 ) 


The classifier is more efficient when the separation factor 
is closer to 0. For different values of e, our algorithm has 
the performance as in Figure im From this simulation 
result, we can see that our algorithm is quite efficient in 
separating images with respect to different digits. This 
result tells that the efficient statistic is in fact a very 
informative way to describe stochastic observations. 

Remark 3. It might be curious at first glance that in 
Figure [m the error probability does not decay as the 
noise level e grows. In fact, this phenomenon can be 
explained as follows. Note that the score function defined 
in this section not only depends on the data vectors , 
but also on the designed channel transition matrix (|^. 
Therefore, different channel transition matrices may pro¬ 
vide different score function on the noisy data vectors 
. We shall notice that the score function is designed 
to extract the feature that can be communicated the best 
through the channel, but not necessary the best feature to 
separate the two sets of images. Thus, the performance 
of the image recognition may not be improved with 
a less noisy channel. On the other hand, we should 
understand our result as that, with a rather arbitrarily 
designed channel transition matrix (|^, we have obtained 
a rather nice performance of error probabilities, which 
does not require any extra learning other than the em¬ 
pirical distributions of the data, i.e., completely unsu¬ 
pervised. Thus, our result demonstrates a new potential 
of applying communication and information theory to 
machine learning problems. 

Remark 4. Dimension reduction is one of the central 
topics in statistics, machine learning, pattern recogni¬ 
tion, and data mining, and has been studied inten¬ 
sively. Celebrated techniques addressing this subject 
including principal component analysis (PCA) ED, K- 
means clustering 1421 . independent component analysis 
(ICA) E2I, and regression analysis 1441 . where many 
efficient algorithms have been developed to implement 
these approaches na 0 . In particular, these approaches 
mainly focus on dealing with the space of the data, rather 
than addressing the information flow embedded in the 
data. On the other hand, recent studies have suggested 
the trend of information-centric data processing ca, 
thus advocating the research direction of analyzing the 
underlying information fiow of networked data. The 
information coupling approach can be considered as a 


technique that aims to provide a framework to reach this 
goal from the information theoretic perspective. From the 
discussions in this section, we can see that information 
coupling studies the data analysis problems from the 
angle of distribution space but not simply the data space. 
Thus, information coupling potentially provides a fresh 
view of how information can be exchanged between 
different terminals in implementing the data processing 
tasks, which not only helps to more deeply understand 
the existing approaches, but also opens a new door to 
develop new technologies. 

Remark 5. While this simple image recognition exam¬ 
ple illustrates the feasibility of introducing information 
coupling to data analysis problems, there are critical 
challenges for future research: 

• How to develop efficient iterative algorithms that 
exploit the structure of the graphical models to com¬ 
pute the singular vectors and evaluate the scores. 

• In the case where some training data are available, 
how the information coupling approach can be 
adjusted to cooperate with the side information. 

• Except for the most informative bit, how can we 
extract the second and third bits from the data, and 
how these bits can be applied to deal with practical 
data analysis tasks. 

V. Conclusions 

Statistical analysis on big data has usually been treated 
as an exercise in statistical data processing. With the 
help of statistical communication theory, we have intro¬ 
duced a new methodology to enable information-centric 
processing (or statistical information processing) for big 
data. Hopefully, this opens new insights into both big 
data analytics and statistical communication theory. 

Although we have demonstrated initial feasibility of 
this methodology, there are further critically associated 
challenges ahead, namely 

• How to identify appropriate or enough variables to 
influence one variable (or a set of variables). 

• How to detect outliers EH- 

• How to generalize big data analytics using large 
communication network analysis beyond multiuser 
communications. 

• How to interpret and adopt traditional machine 
learning approaches and data processing technolo¬ 
gies, such as (un)supervised learning, feature se¬ 
lection, blind source separation, via the techniques 
developed in network communication theories. 

Appendix A 

Equalizer Implementation for Communication 
Theoretic Data Analytics 

As in Fig. 4, by proper selection of adaptive algorithm 
and step size, the output of the equalizer after training 
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period gives the inference 


y[n] = '^^^wix[n — /]. 


( 11 ) 


/=o 


argmin < [ y[n] — '^^^wix[n — 1] 


z=o 



( 12 ) 


and 


y[n^l] = ^wix[n-l\, 


1=0 


(13) 


M 

min E I ( - Y] OLm[n]y- 

1 m=l 



(14) 


where 




Necessary conditions for this minimization gives the 
solution for am [n]. The consequent estimator is therefore 


M 


M 


yW = E <^ni[n]ym[n] = am[n]'^wi^mXm[n - 1] 

(15) 


m=l 


m=l 


1=0 


Fig. 12. Maximal-Ratio Combining of Two Equalized Data Variables 
(Time Series). 


Based on the minimum MSE criterion, the purpose of 
training data is to obtain 

2 ^ 


argmm ^ ^ y[n\ — ^ wix[n — 1] 

The first equation is to obtain the vector of weighting 
coefficients, and the second equation is to identify the 
most appropriate observation depth, L. Once we identify 
L, we keep it and therefore the equalizer structure to 
infer data. We may keep the same set of coefficients or 
update online. Please note we may also obtain a predictor 
as follows: 


where we cannot go into further detail due to the length 
constraint on this paper. 

When we have two (or more) data variables to infer 
another data variable, say using Xi and X 2 to infer Y, 
we have to use information fusion as in Figure [T^ Again, 
we adopt the minimum MSE criterion, to yield 

2' 


1=0 


which is defined as the maximal ratio combining of 
equalized multivariate regression of different optimal 
observation lengths Lm, rn = 1,..., M. This design re¬ 
alizes the idea of maximizing information fiow between 
data variables or time series. For ease of implementation, 
we may set am[n] = am, or we may adopt selective 
combining and equal-gain combining. 

Remark 6. A conjecture to explain why we intend to 
equalize data of a certain length Lm, instead of the 
entire data set, is that earlier components in the time 
series may introduce very noisy information, like inter¬ 
ference or noise in multiuser communication systems 
or simply weakly correlated information after a large 

time separation. Such lengths Lm, 'm = 1,_, M, 

for data variables Xi,, Xm, represent the span/range 
of useful data for inference. Of course, based on the 
MSE, we may further select useful data variables among 
Xi,..., Xm- Similar concepts are not rare in machine 
learning, for example, to identify support vectors in 
support vector machines (SVMs). What we are doing 
here is more effective implementation by properly se¬ 
lecting data variables, range of observations, and finally 
weighting coefficients in each equalizer, for multivariate- 
regression leveraging the optimization of information 
transfer between relational data variables. 
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